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[- — Abstract 

We investigate the role of the initialization for the stability of the k- 
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means clustering algorithm. As opposed to other papers, we consider the 
actual fc-means algorithm and do not ignore its property of getting stuck 
in local optima. We are interested in the actual clustering, not only in the 
costs of the solution. We analyze when different initializations lead to the 
same local optimum, and when they lead to different local optima. This 
enables us to prove that it is reasonable to select the number of clusters 
based on stability scores. 
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1 Introduction 

Stability is a popular tool for model selection in clustering, in particular to se- 
lect the number k of clusters. The general idea is that the best parameter k for 
a given data set is the one which leads to the "most stable" clustering results. 
While model selection based on clustering stability is widely used in practice, its 
behavior is still not well-understood from a theoretical point of view. A recent 
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Figure 1 : Different initial configurations and the corresponding outcomes of the 
fc- means algorithm. Figure a: the two boxes in the top row depict a data set with 
three clusters and four initial centers. Both boxes show different realizations of 
the same initial configuration. As can be seen in the bottom, both initializations 
lead to the same fc-means clustering. Figure b: here the initial configuration is 
different from the one in Figure a, which leads to a different fc-means clustering. 
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that one has access to an ideal algorithm which can globally optimize the fc- 
means criterion. For this perfect algorithm, results on stability are proved in 
the limit of the sample size n tending to infinity. However, none of these results 
applies to the fc-means algorithm as used in practice: they do not take into 
account the problem of getting stuck in local optima. In our current paper we 
try to overcome this shortcoming. We study the stability of the actual fc-means 
algorithm rather than the idealized one. 

Our analysis theoretically confirms the following intuition. Assume the data 
set has K well-separated clusters, and assume that fc-means is initialized with 
K' > K initial centers. We conjecture that when there is at least one initial 
center in each of the underlying clusters, then the initial centers tend to stay in 
the clusters they had been placed in. 

Consequently, the final clustering result is essentially determined by the number 
of initial centers in each of the true clusters (which we call the initial configura- 
tion), see Figure [I] for an illustration. In particular if one uses an initialization 
scheme which has the desired property of placing at least one center in each 
cluster with high probability, then the following will hold: If K' — K, we have 
one center per cluster, with high probability. The configuration will remain the 
same during the course of the algorithm. If K' > K, different configurations 
can occur. Since different configurations lead to different clusterings we obtain 
significantly different final clusterings depending on the random initialization, 
in other words we observe instability (w.r.t initialization) . 
Note that our argument does not imply stability or instability for K' < K. As 
we have less initial centers than clusters, for any initialization scheme there will 
be some clusters with no initial center. In this setting centers do move between 
clusters, and this cannot be analyzed without looking at the actual positions of 
the centers. Actually, as can be seen from examples, in this case one can have 
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cither stability or instability. 

The main point of our paper is that the arguments above can explain why the 
parameter fc selected by stability based model selection is often the true number 
of clusters, under the assumption that the data set consists of well separated 
clusters and one uses an appropriate initialization scheme. 

Even though the arguments above are very intuitive, even individual parts of 
our conjecture turn out to be surprisingly hard. In this paper we only go a 
first step towards a complete proof, considering mixtures of Gaussians in one 
dimension. For a mixture of two Gaussians (K = 2) we prove that the fc-means 
algorithm is stable for K' = 2 and instable for K' = 3. The proof technique 
is based on our configuration arguments outlined above. We also provide some 
preliminary results to study the general case, that is when the data space is R d 
and we do not make any parametric assumption on the probability distribution. 
Then we have a closer look at initialization schemes for fc-mcans, when K' > K. 
Is there an initialization scheme that will place at least one center in each true 
cluster w.h.p? Clearly, the naive method of sampling K' centers from the data 
set does not satisfy this property except for very small K. We study a standard 
but not naive initialization scheme and prove that it has the desirable property 
we were looking for. 

Of course there exist numerous other papers which study theoretical properties 
of the actual fc-means algorithm. However, these papers are usually concerned 
with the value of the fc-means objective function at the final solution, not with 
the position of the final centers. As far as we know, our paper is the first one 
which analyzes the "regions of attractions" of the different local optima of the 
actual fc-means algorithm and derives results on the stability of the fc-means 
clustering itself. 

2 Notation and assumptions 

In the following we assume that we are given a set of n data points X\,..., X n e R 
which have been drawn i.i.d. according to some underlying distribution P. For 
a center vector c = (ci, ...,Ck') with Cj € R d we denote the cluster induced by 
center with Cfc(c). The number of points in this cluster is denoted N k (c). The 
clustering algorithm we study in this paper is the standard fc-means algorithm. 
We denote the initial centers by cf a> , ...,c^, with cf 0> e R, and the centers 
after step t of the algorithm as cf* > , c^*, 5 *. By K we denote the true number 
of clusters, by K' the number of clusters constructed by the fc-means algorithm. 
It attempts to minimize the fc-mcans objective function 
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We now restate the fc-means algorithm: 
Input: X 1} ...,X n £ R d , K' £ IN 
Initialize the centers c< 0> , c$ > £ R d 
Repeat until convergence: 

1. Assign data points to closest centers. 

2. Re-adjust cluster means: 



KV ' i: x,ec t ( c <'>) t 



Output: c= (c< /ma/> ,...,c<( maZ> ). 

Traditionally, the instability of a clustering algorithm is defined as the mean 
(with respect to the random sampling of data points) minimal matching dis- 
tance between two clusterings obtained on two different set of data points. For 
the actual £;-means algorithm, a second random process is the random initial- 
ization (which has not been taken into account in previous literature). Here we 
additionally have to take the expectation over the random initialization when 
computing the stability of an algorithm. In this paper we will derive qualitative 
rather than quantitative results on stability, thus we omit more detailed formu- 
las. 



In the following we restrict our attention to the simple setting where the un- 
derlying distribution is a mixture of Gaussians on R and we have access to an 
infinite amount of data from P. In particular, instead of estimating means em- 
pirically when calculating the new centers of a fc-means step we assume access 
to the true means. In this case, the update step of the fc-means algorithm can 
be written as 

C <*+1> = fc k (c<*>) x f( x ) dx 

/c fc (c<*>) f( x ) dx 

where / is the density of the probability distribution P. Results in the finite 
data case can be derived by the help of concentrations inequalities. However, 
as this introduces heavy notation and our focus lies on the random initializa- 
tion rather than the random drawing of data points we skip the details. To 
further set up notation we denote <p M)(T the pdf of a Gaussian distribution with 
mean /i and variance a. We also denote f(x) — X^fe=i w kf^ k .a where K is the 
number of Gaussians, the weights Wk are positive and sum to one, the means 
fJ-i-.K — (Mi> • • • >M-ff) are ordered, /ii < . . . < fJ,K- The minimum separation 
between two Gaussians is denoted by A = min^ink+i — Mfe)- For the standard 
normal distribution we denote the pdf as ip and the cdf as 
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3 The level sets approach 



In this section we want to prove that if we run the fc-means algorithm with 
K ' = 2 and K' = 3 on a mixture of two Gaussians, then the resulting clustering 
depends exclusively on the initial configuration. More precisely if we initialize 
the algorithm such that each cluster gets at least one center and the initial cen- 
ters are "close enough" to the true cluster means, then during the course of the 
algorithm the initial centers do not leave the cluster they had been placed in. 
This implies stability for K ' — 2 since there is only one possible configuration 
satisfying this constraint. On the other hand for K' — 3 we have two possible 
configurations, and thus instability will occur. 

The following function plays an important role in our analysis: 

H : R 2 — ► R, H(x, y) = x$(-x + y) - <p(-x + y). 

Straightforward computations show that for any /i, cr, a and h one has 

h i s i \ 7 TT / a h + a — (i\ 

(x — n + a)ipa : a{x)dx = all —, . (2) 

oo a ) 

We describe necessary and sufficient conditions to obtain stability results for 
particular "regions" in terms of the level sets of H. 

3.1 Stability in the case of two initial centers 

We consider the square S a — \pi — a, fi 1 + a] x [/i 2 — a, fi2 + a] in R 2 . The region 
S a is called a stable region if 

c <0 > G S a C < x > G S a (3) 

Proposition 1 (Stable region for K' = 2) Equation is true if and only 
if the following four inequalities are satisfied: 

(a A \ fa + A A\ 
»mH (-, — )+ W2 H[ ,^ >0 (4) 



, a A\ /-o + A A', 

, mH — \ +W2 H[ — ,— <0 (5) 

a la I \ a la ; 



.'a- A A\ fa A', 

. W1 H{ ,-_ )+w 2 H (-,_ >0 (6) 

a la I \a la ; 



f-a-A A\ ( a A\ 

. W1 H( —)+w 2 H (—,-—)< (7) 

\ a la J \ a la J 

Proof. Similar to the proof of Proposition [3] see below. © 
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This proposition gives necessary and sufficient conditions for the stability of 
k- means in the case K' = 2. In the following corollary we show an example of 
the kind of result we can derive from Proposition [l] Note that the parameters 
a and A only appear relative to a. This allows us to consider an arbitrary a. 

Corollary 2 (Stability for K' = 2) Assume that min(wi, W2) = 0.2 and A = 

7a. Assume that we have an initialization scheme satisfying: 

• with probability at least 1 — 6 we have one initial center within 2.5a of /ii 
and one within 2.5a 0//Z2. 

Then k-means is stable in the sense that with probability at least 1—6 it converges 
to a solution with one center within 2.5a of fi\ and one within 2.5a of 

Proof. We simply check numerically that for a — 2.5a, A = la and w\ = 0.2 
(we also check u> 2 = 0.2) Equations Q - ^ are true. Then by Proposition [l] 
we know that S a is a stable region which implies the result. © 



3.2 Instability in the case of 3 centers 

The case of 3 centers gets more intricate. Consider the prism T a j, £ and its 
symmetric version sym(T a b, e ) in R 3 : 

T a ,b,e ={c 6 R 3 : Ci < C 2 < C3, 

c G \p% — a, /ii + a — s] x \pi — a + e, fi\ + a] x [/x 2 — b, fi 2 + b]} 
sym(T a<bte ) ={c e R 3 : c x < c 2 < c 3 , 

c G [fix — b, fii + b] x - a, ^2 + a - e] x - a + £, ^2 + a}}- 

If we have an initialization scheme such that each cluster gets at least one 
center and the initial centers are close enough to the true cluster means, then 
we initialize cither in T a ^.e or sym(T a _b tE ). Thus, if these regions arc stable in 
the following sense: 

c <0> G T aAe c <x> G T aA£ (8) 

then the global k-means algorithm will be instable, leading either to a clustering 
in T Qj f, j£ or sym(T a ^ yE ). Expressed in the terms used in the introduction, the 
algorithm will be initialized with different configurations and thus be instable. 

Proposition 3 (Stable region for K' = 3) Equation (8b is true if and only 
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if all the following inequalities are satisfied: 

. WiH (*") +W2 h('^")>0 (9) 



a 2cr/ V a 2a 



u/ ~ a + £ £ \ , u(^ a + A + £ £ 
*wiH , — + W2H , — < (10) 

1 a 2a J V a 2a } ~ 



. 'a-e a-b + A-e\ „ / a - e + A a-b + A-e 

.,,,//[ , J +-2^^ ^ , ^ 

(7 2(7/ \ (7 2(7/ 



/ a b-a+A\ /-a + A b-a + A 

V a 2a J V o- 2<7 

< Wl ff(_Vf)+^(=^,-f) (12) 

V (7 2(7 / V <7 2(7 / 



'6-A 6-a-A + £\ /6-A 6-a-A + e 

•WiH , + u^ij 



2(7 / V cr ' 2(7 

< 6/cr - WiA/a (13) 

f-b-A a-b-A\ ( b a-b-A 

\ a 2cr J \ a' 2cr 

>-b/a-w 1 A/a (14) 

Proof. (Sketch) Let c <0> S T oAe . Note that the fc-means algorithm in one di- 
mension does not change the orders of centers, hence < c^ :1> < cf 3> . By 
the definition of T a ^ to prove that after the first step of fc-means the centers 
c <x> are still in T a j, e we have to check six constraints. Due to space constraints, 
we only show how to prove that the first constraint cj cl> > [i\ — a is equivalent 
to Equation rt9j) . The other conditions can be treated similarly. 

The update step of the fc-means algorithm on the underlying distribution read- 
justs the centers to the actual cluster means: 



r<°>4-^<°> 



1 



c l — r .<0>,<0> / xf(x). 

Loo 2 /(*) 

Thus, cj cl> > /ii — a is equivalent to 



(a;-/*i + o)/(a;) > 0. 

3 (a; - /ii + o)/(x) is n 
[/ii-0,+00). Since c <0> € T a , 6>£ we know that (cf 0> + c^ 0> )/2 > ^-a + e/2 



Moreover, the function /i 1— * f_ (x — /ii + a)f(x) is nondecreasing for /z € 
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and thus the statement Vc <0> G T a ^ t£ ,c\ > [i\ — a is equivalent to 

fill— a+e/2 

(x-m + a)f(z)>0. 
We can now apply Eq. ^ with the following decomposition to get Eq. Q: 

rfJrl— a+e/2 



f (x - Hi + a)f(x) 

J — oo 

//Hi— a+e/2 rfii —a+e/2 

(x — 4- a)Wi,<7 +W2 I (x — fi2 + A + a)(pfj, 2 

-oo J —00 
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A simple symmetry argument allows us to treat the stability of the symmetric 
prism. 

Proposition 4 If T ai b <s is stable for the pdf f{x) = Wx<Pn Xt<T + w-2<p n 2 ,o and 
f{x) — W2f l _ llt a + wiipfj_ 2 , a , then the same holds for sym(T a ^, e ) ■ 

Proof. The fc-means algorithm is invariant with respect to translation of the 
real axis as well as to changes in its orientation. Hence if T a! b. e is stable under 
/ (resp. /), so is sym(T a ^ e ) under f(x) = w 2 tp l _ ll , a + Wi<P&,,o (resp. /). © 



Corollary 5 (Instability for K' = 3) Assume that min.(tt>i, W2) = 0.2 and 
A = 14.5cr. Assume that we have an initialization scheme satisfying: 

• with probability at least (1 — 5)/2 we have 2 initial centers within 2.5a of 
fix and 1 initial center within 2.5a of (12 

• with probability at least (1 — S)/2 we have 1 initial centers within 2.5a of 
fix and 2 initial centers within 2.5a of fj,2 

Then k-means is instable: with probability ( 1 — <5) / 2 it will converge to a solution 
with two centers within 3.5a of /ii and with probability (1 — 5)/2 to a solution 
with two centers within 3.5a of 

Proof. We simply check numerically that for a — 3.5a, b = 2.5a, e = a, 



A = 14.5cr and wi — 0.2 (we also check tu 2 = 0.2) Equations (|9| - (14 1 are 
true. Then by Proposition [3] and Proposition [4] we know that T 3 , 5a ,2.5<j,<j and 
its symmetric sym{T^ c, a ^2.b<y,<y) are stable regions which implies the result. © 
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4 Towards more general results: the geometry 
of the solution space of fc-means 



In the section above we proved by a level set approach that in a very simple 
setting, if we initialize the fc-means algorithm "close enough" to the true clus- 
ter centers, then the initial centers do not move between clusters. However we 
would like to obtain this result in a more general setting. We believe that to 
achieve this goal in a systematic way one has to understand the structure of 
the solution space of fc-means. We identify the solution space with the space 
R dK by representing a set of K' centers c\,...,Ck' € R d as a point c in the 
space R dK . Our goal in this section is to understand the "shape" of the fc- 
means objective function on this space. Secondly, we want to understand how 
the fc-means algorithm operates on this space. That is, what can we say about 
the "trajectory" of the fc-means algorithm from the initial point to the final 
solution? For simplicity, we state some of the results in this section only for the 
case where the data space is one dimensional. They also hold in R d , but are 
more nasty to write up. 

First of all, we want to compute the derivatives of W n with respect to the 
individual centers. 

Proposition 6 (Derivatives of fc-means) Given a finite data setXi, ...,X n e 
R. For k, I € {1, K'} and i € {1, n} consider the hyperplane in R K which 
is defined by 



3. The third derivatives of W n on R K \ H all vanish. 

Proof. First of all, note that the sets H k [ ^ contain the center vectors for which 
there exists a data point Xi which lies on the boundary of two centers c k and 
q. Now let us look at the first derivative. We compute it by foot: 



H k . u := {c e R K> : X t = (c fc + q)/2}. 
Define the set H := U^ i=1 U" =1 Hkj.i- Then we have: 

1. W n is differentiable on R K \ H with partial derivatives 




2. The second partial derivatives of W n on R K \ H are 





dW n (c) 
dc k 



= l im n r(W"( c i> °k) ~ W n (d, c k + h, ...,c K )) 
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When c ^ H we know that no data point lies on the boundary between two 
cluster centers. Thus, if h is small enough, the assignment of data points to 
cluster centers does not change if we replace Ck by Ck + h. With this property, 
the expression above is trivial to compute and yields the first derivative, the 
other derivatives follow similarly. © 



A straightforward consequence is as follows: 

Proposition 7 (fc-means does Newton iterations) The update steps per- 
formed by the k-means algorithms are exactly the same as update steps by a 
Newton optimization. 

Proof. This proposition follows directly from Proposition [6] the definition of 
the Newton iteration on W n and the definition of the fc-means update step. 



This fact has also been stated (less rigorously and without proof) in Bottou and 



Bengio (19951. © 



Together, the two propositions show an interesting picture. We have seen 
in Proposition [6] that the fc-means objective function W n is differentiable on 
R K \ H. This means that the space R K is separated into many cells with hy- 
perplane boundaries H^ t n. By construction, the cells are convex (as they are 
intersections of half-spaces). Our finding means that each data set X\, ...,X n 
induces a partitioning of this solution space into convex cells. To avoid confu- 
sion, at this point we would like to stress again that we are not looking at a 
fixed clustering solution on the data space (which can be described by cells with 
hyperplane boundaries, too), but at the space of all center vectors c. It is easy 
to see that all centers c within one cell correspond to exactly one clustering of 
the data points. As it is well known that the fc-means algorithm never visits 
a clustering twice, we can conclude that each cell is visited at most once by 
the algorithm. Within each cell, W n is quadratic (as the third derivatives van- 
ish). Moreover, we know that fc-means behaves as the Newton iteration. On 
a quadratic function, the Newton optimization jumps in one step to the mini- 
mum of the function. This means that if fc-means enters a cell that contains a 
local optimum of the fc-means objective function, then the next step of fc-means 
jumps to this local optimum and stops. 



Now let us look more closely at the trajectories of the fc-means algorithm. The 

inspired us to derive the following property. 

Proposition 8 (Trajectories of fc-means) Let c <4> and c <t+1> be two con- 
secutive solutions visited by the k-means algorithm. Consider the line connecting 
those two solutions in R K , and let c a = (1 — a)c <t> + ac <t+1> be a point on 
this line (for some a € [0, 1]). Then W n (c a ) < W n {c <l> ). 



paper by Zhang et al. (2008 
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Proof. The following inequalities hold true: 



w n (c a ) = \j2 E \\ x *~ c k\\ 2 

fe=iiec fc (c°) 



K 

I 

< 



lH E n*-«*i ,a 

fe=li6C fe (c«) 

<^E E a|i^-4ii a + (i-«)H^-4 +1 ii a 



fe=l»£C fc (c«) 

< aW^') + (1 - a)IU„(c t+1 ) 

For the first and third inequality we used the fact that assigning points in Cj~ (c) 
to the center Ck is the best thing to do to minimize W n . For the second inequal- 
ity we used that x — > ||a;|| 2 is convex. The proof is concluded by noting that 
W„(e <4> ) < W n {c< t+1 >). © 

We believe that the properties of the fc-means objective function and the al- 
gorithm are the key to prove more general stability results. However, there is 
still an important piece missing, as we are going to explain now. Since fc-means 
performs Newton iterations on W n , one could expect to get information on the 
trajectories in the configuration space by using a Taylor expansion of W n . How- 
ever, as we have seen above, each step of the fc-means algorithm crosses one of 
the hyperplanes Hk.i,i on which W n is non-differentiable. Hence, a direct Taylor 
expansion approach on W n cannot work. On the other hand, surprisingly one 
can prove that the limit objective function W := lim^W n is almost surely a 
continuously differentiablc function on R K (we omit the proof in this paper). 
Thus one may hope that one could first study the behavior of the algorithm for 
W, and then apply concentration inequalities to carry over the results to W n . 
Unfortunately, here we face another problem: one can prove that in the limit 
case, a step of the fc-means algorithm is not a Newton iteration on W. 

Proposition [8] directly evokes a scheme to design stable regions. Assume that 
we can find two regions A C B C R K of full rank and such that 

max W n (x) < min W n (x). (15) 

xedA xedB 

Then, if we initialize in A we know that we will converge to a configuration 
in B. This approach sounds very promising. However, we found that it was 



impossible to satisfy both Equation (151 and the constraint that A has to be 



"big enough" so that we initialize in A with high probability. 

Finally, we would like to elaborate on a few more complications towards more 
general results: 
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• On a high level, we want to prove that if K' is slightly larger than the true 
K, then fc-means is instable. On the other hand, if K' gets close to the 
number n of data points, we trivially have stability again. Hence, there is 
some kind of "turning point" where the algorithm is most instable. It will 
be quite a challenge to work out how to determine this turning point. 

• Moreover, even if we have so many data points that the above problem is 
unlikely to occur, our analysis breaks down if K' gets too large. The reason 
is that if K' is much bigger than K, then we cannot guarantee any more 
that initial centers will be in stable regions. Just the opposite will happen: 
at some point we will have outliers as initial centers, and then the behavior 
of the algorithm becomes rather unpredictable. 

• Finally, consider the case of K' < K . As we have already mentioned in the 
introduction, in this case it is not necessarily the case that different initial 
configurations lead to different clusterings. Hence, a general statement on 
(in)stability is not possible in this case. This also means that the tempting 
conjecture "the true K has minimal stability" is not necessarily true. 



5 An initialization algorithm and its analysis 



We have seen that one can prove results on clustering stability for fc-means if 
we use a " good" initialization scheme which tends to place initial centers in dif- 
ferent Gaussians. We now show that an established initialization algorithm, the 
Pruned MinDiam initialization described in Figure [2] has this property, i.e it 
has the effect of placing the initial centroids in disjoint, bounded neighborhoods 
of the means Hi-k- This often rediscovered algorithm is credited to |Hochbaum| 
and Shmoys (19851. In Dasgupta and Schulman (20071 it was analyzed it in the 
context of the EM algorithm. Later Srebro et al. ( 2006 ) used it in experimental 



evaluations of EM, and it was found to have a significant advantage w.r.t more 
naive initialization methods in some cases. While this and other initializations 
have been extensively studied in conjunction with EM, we are not aware of any 
studies of Pruned MinDiam for fc-means. 

We make three necessary conceptual assumptions. Firstly to ensure that K is 
well-defined we assume that the mixture weights are bounded below by a known 
weight w min . 

Assumption 1 Wk > w m in for all k. 

We also require to know a lower bound A and an upper bound A max on the sep- 
aration between two Gaussians, and we assume that these separations are "suf- 
ficiently large". In addition, later we shall make several technical assumptions 
related to a parameter r used in the proofs, which also amount to conditions 
on the separation. These assumptions shall be made precise later. 

Theorem 9 (Pruned MinDiam Initialization) Let f = J2i w kf^ k ,i be a 
mixture of K Gaussians with centers (J-i-.k, Mfc < (J-k+i, arid unit variance. Let 

T £z (0,0.5), S m i ss ^> 0, ^impure 



defined in Proposition 12 Lf we run Algorithm 
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Algorithm Pruned MinDiam 
Input: Wmin, number of centers K' 

1. Initialize with L random points cf.^, L computed by (191 

2. Run one step of fc-means, that is 

(a) To each center cf 0> assign region C°, j = 1 : L 

(b) Calculate cf.]^ as the centers of mass of regions C°. L 



3. Remove all centers cf for which P[Cj] < po, where po is given by (19 1. We 
are left with c^ x> , j' = 1 : L' . 

4. Choose K' of the remaining centers by the MinDiam heuristic 

(a) Select one center at random. 

(b) Repeat until K' centroids are selected: 

Select the centroid cf 1:> that maximizes the minimum distance to the 
already selected centroids. 

Output: the K' selected centroids c^ 1> , k — 1 : K' 



Figure 2: The Pruned MinDiam initialization 

Pruned MinDiam with any 2 < K' < l/w m i n , then, subject to Assumptions 
1, 2, 3, 4, 5 (specified later), with probability 1 — 25 m i SS — Si mpure over the 
initialization there exist K disjoint intervals A^, specified in Section 5.4 one 



for each true mean Hk, so that all K' centers c^, 1> are contained in {J k A^ and 



if K' — K, each A^ will contain exactly one ce nter c^, (16) 
if K' < K, each A^ will contain at most one ce nterc<}>, (17) 
if K > K, each A^ will contain at least one center c^. (18) 

The idea to prove this result is to show that the following statements hold 
with high probability. By selecting L preliminary centers in step [l] of Pruned 
MinDiam, each of the Gaussians obtains at least one center ( Section |5.1[ ). After 



steps 2a 2b we obtain "large" clusters (mass > po) and "small" ones (mass 
< po). A cluster can also be "pure" (respectively "impure") if most of its 
mass comes from a single Gaussian (respectively from several Gaussians) . Step 
[3] removes all "small" cluster centers, but (and this is a crucial step of our 



argument) w.h.p it will also remove all "impure" cluster centers (Section 5.2 1 
The remaining clusters are "pure" and "large" ; we show (Section |5.3[ ) that each 
of their centers is reasonably close to some Gaussian mean Hence, if the 
Gaussians are well separated, the selection of final centers c^ x> in step [i] "cycles 
through different Gaussians" before visiting a particular Gaussian for the second 



time (Section 5.4 1. The rest of this section outlines these steps in more details. 
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5.1 Step [T] of Pruned MinDiam. Picking the initial cen- 
troids c <0> 

We need to pick a number of initial centers L large enough that each Gaussian 
has at least 1 center w.h.p. We formalize this here and find a value for L that 
ensures the probability of this event is at least 1— 5 m i SS , where 5 m i SS is a tolerance 
of our choice. Another event that must be avoided for a "good" initialization 
is that all centroids c^ 0> belonging to a Gaussian end up with initial clusters 
C® that have probability less than pq. If this happens, then after thresholding, 
the respective Gaussian is left with no representative centroid, i.e it is "missed". 
We set the tolerance for this event to Sthresh = 5 m i SS . Let t — 2<i>(— A/2) the 
tail probability of a cluster and Ak the symmetric neighborhood of that has 

Proposition 10 If we choose 

L > (in- 1 ] / ( (1 - t)w min ) and p = — (19) 



' miss wmin 



eL 



then the probability over all random samplings of centroids cji^ that at least one 
centroid c^ 0> with assigned mass P[Cj] > po can be found in each Ak, k = 1 : K, 
is greater or equal to 1 — 2<5„ 



J rniss • 



The proof of this result is complicated but standard fare (e.g. Chernoff bounds) 
and is therefore omitted. 



After steps |T)[2a| and 2b of Pruned MinDiam are performed, we obtain centers 



cf.^ situated at the centers of mass of their respective clusters C\. L . Removing 
the centers of small clusters follows. We now describe a beneficial effect of this 
step. 

5.2 Step [3] of Pruned MinDiam. Thresholding removes 
impure clusters 

We introduce the concept of purity of a cluster, which is related to the ratio of 
points from a certain Gaussian w.r.t to the total probability mass of the cluster. 
Denote Pk the probability distribution induced by the k-th Gaussian (/?^ fc ,i- 

Definition 11 A cluster C is (1 — r)-pure if most of its points come from a 
single Gaussian, i.e if WkPk[C] > (1 — r)P[C], with t < 1/2 being a positive 
constant. A cluster which is not (1 — r)-pure is r-impure (or simply impure/ 

The values of r that we consider useful are of the order 0.001 — 0.02 and, as it 
will appear shortly, r < w min /2. The purity of a cluster helps in the following 
way: if a cluster is pure, then it can be "tied" to one of the Gaussians. Moreover, 
its properties (like center of mass) will be dictated by the Gaussian to which it 



is tied, with the other Gaussians' influence being limited; Section 5.3 exploits 
this idea. 
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But there will also be clusters that are impure, and so they cannot be tied to 
any Gaussian. Their properties will be harder to analyze, and one expects their 
behavior to be less predictable. Luckily, impure clusters are very likely small. 
As we show now, the chance of having an impure cluster with mass larger than 
Po is bounded by a 5i mpure which we are willing to tolerate. 
Because of limited space, we leave out the long and complex rigourous proofs 
of this result, and give here just the main ideas. Let Cj = [zi, z 2 ] be a r-impure 
cluster, with P[Cj] > po, Cj the centroid that generates Cj (not necessarily at 
its center of mass) and Cj_ i,Cj+i the centroids of the adjacent clusters (not 
necessarily centers of mass). As one can show, even though an impure cluster 
contains some probability mass from each Gaussian, in most of this section we 
only need consider the two Gaussians which are direct neighbors of C. Let us 
denote the parameters of these (consecutive) Gaussians by ^1,2, wi, 2- 
For the purpose of the proof, we are looking here at the situation after step [2a| 
thus the centroids Cj-i.j J+ i should be c^j^ but we renounce this conven- 
tion temporarily to keep the notation light. We want to bound the probability 
of cluster Cj being impure and large. Note that Step [2b] of the Pruned Min- 
DlAM does not affect either of these properties, as it only acts on the centers. 
A simple observation is the following. Since z\ = Cj ~^ +Cj and z 2 = Cj+ ^ +C3 
we have c 3+ i — Cj-\ = 2(z 2 — Z\) = 2Az. The idea is to show that if 
an impure region has probability larger than p , then the interval [cj-i, Cj+i] 
has probability at least pi, significantly larger than pq. On the other hand, the 
probability of sampling from P a single center Cj out of a total of L in an interval 
of length 2Az is P[c,-_i, c i+i ](l - P[c j - 1 ,c J+1 ]) L - 1 < (1 - pi)^ 1 . If pi and 

L are large enough, then (1 — pi) L_1 = Si, npure will be vanishingly small. We 
proceed in two steps: first we find the minimum length Az of a cluster Cj which 
is impure and large. Then, we find a lower bound p\ on the probabability of any 
interval [c, c+2Azo] under the mixture distribution. The following assumption 
ensures that the purity 1 — r is attainable in each Gaussian. 

Assumption 2 Let jk.k'ix) — wi^ k 'i(x) ( a local purity measure). Then 

1 , (l-r)po 



The next assumption ensures that Azo > 0, i.e it is an informative bound. 
Assumption 3 d(^^j < |A. 

Proposition 12 (Impure clusters are small w.h.p) Leiwi,u> 2 be the mix- 
ture weights of two consecutive Gaussians and define Az = A — d I ^ 



' ( TPo\ 

\W2 J' 



/'A-2Az \ /A-2Az ln 

pi = wi$ - — + 



Wl 



A-2Az o y V 2 A-2Az 0/ 
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Figure 3: Concrete example of a large 
impure cluster [zl,z2]; cl,c, c2 rep- 
resent the consecutive cluster centers 
cf2i , cf 0> , cf^ . We demonstrate that 
if P[zl,z2] > po then the interval 
[cl,c2] (which is twice its length) must 
have mass > pi >> po- If L is 
large enough, having such a large in- 
terval contain a single Cj is improbable. 
Numerical values: mixture with A = 
W,w m m = 0.15, impurity r([zi,z 2 ]) 
().()?. P[zl,z2] = 0.097, P[cl,c2] = 0.24; 
using 5 misa = 0.02, r = 0.015 one gets 
L = 38, po = 0.095 < P[zl,z2],p! = 

e = 0.016 >> 



0.0105 < P[cl,c2], Si, 
(1-P[cl,c2][ 



= 0.00003 



and 5i mpure — (1 — Pi) L 1 ■ Let C®,j = 1,...,L be the regions associated with 
c i°l~ a ft er ste P ^ a °f the Pruned MinDiam algorithm. If assumptions l\S\3 



hold, then the probability that there exists j G {1, . . . , L} so that P[Cj] > po and 
wiP\[Cj] > tP[Cj], W2P2[Cj] > tP[Cj] is at most Si mpure . This probability is 
over the random initialization of the centroids cji? > . 

To apply this proposition without knowing the values of wi,W2 one needs to 
minimize the bound p\ over the range w\, W2 > w m i n , W2+W1 < \ — {K—2)w m i n . 
This minimum can be obtained numerically if the other quantities are known. 
We also stress that because of the two-step approach, first minimizing Az , then 
P[c, c+2Azq], the bound Si mpure obtained is not tight and could be significantly 
improved. 



5.3 The (1 — r)-pure cluster 

Now we focus on the clusters that have P[C] > po and are (1 — r)-pure. By 
Proposition [12] w.h.p their centroids are the only ones which survive the thresh- 
olding in step [3] of the Pruned MinDiam algorithm. In this section we will 
find bounds on the distance |c^ :1> — Hh\ between Cj's center of mass and the 
mean of "its" Gaussian. 

We start by listing some useful properties of the standard Gaussian. Denote by 
r(x) the center of mass of [x, oo) under the truncated standard Gaussian, and by 
d(t) the solution of 1 — &(d) — t, with < t < 1. Intuitively, d(t) is the cutoff 
location for a tail probability of t. Note that any interval whose probability 
under the standard normal exceeds t must intersect [— d(t), d(t)]. Let a > (in 
the following a as to be thought as a small positive constant). 
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Proposition 13 (i) r{x) is convex, positive and increasing for x > (ii) For 
w € [2a, oo ) the function d(a/w) is convex, positive and increasing w.r.t w, and 
r{d{a/w)) is also convex, positive and increasing. 

Proposition 14 Let C = [zi,Z 2 ] be an interval (with z\,z 2 possibly infinite), c 
its center of mass under the normal distribution ip/ii and P[C] its probability 
under the same distribution. If 1/2 > P[C] > p, then \c — fi\ < r(d(p)) and 
min{|2i -n\, \z 2 < d(p) = -$" 1 (p). 

The proofs are straightforward and omitted. Define now w max = 1— [K — \)w m i n 
the maximum possible cluster size in the mixture and 



R{w) 



(1 " r) Pc 



R(wi,w 2 ) = 



(1 — t)w 2 Wi 



In the next proposition, we will want to assume that R > 0. The following 
assumption is sufficient for this purpose. 

Assumption 4 ^— < § - $(-A/2) 

Proposition 15 (The (1 — r)-pure cluster) Let cluster C = [z\,z 2 \ with z 2 > 
fi k , P[C] > Po and w k P k [C] > (1— r)P[C] for some k, with r satisfying Assump- 
tions^and^ Let c,c k denote the center of mass of C under P,Pk respectively. 
Then 

|c fe -/ife| < R(w k ) (20) 

and, whenever k < K 

Z 2 - H k <-R(w k .W k +l) < -R(w m ax,W min ) (21) 

Proposition 16 (Corollary) If c k > [i k and k < K then 

c- Mfe < (1 - T)R(w k ) + r(A - R(w k ,w k+1 )) (22) 

< (1-t)R(w mti.r 

) + r(A - R(w in a x •■ Wmin)) (23) 

< (l-T)R(w max )+TA (24) 

else 

Mfe-c<-R(w fc ) < R(w max ) c- fl k < r(A - R(w k ,w k+1 )) (25) 

By symmetry, a similar statement involving fi k -i, w k -±, fi k , w k and c holds when 
z i > Mfe is replaced by Z\ < \i k . With it we have essentially shown that an almost 
pure cluster which is not small cannot be too far from its Gaussian center \i k . 



Proof of Proposition 15 ( 201) follows from Proposition 14 Now for bounding 



z 2 , in the case k < K. Because (1 — t)P[C] < w k (the contribution of Gaussian 

k to cluster C cannot exceed all of w k ) we have P k +i [CI < rP ^ < T f k 

and P k+ i[C] = <S>(z 2 - Mfe+i) - ®( Z X-. Mfc+i) > *( z 2 - Mfe+i) - *(ci - Mfc+i) 



from which the first inequality in ( |2l| follows. The function i? is increasing 
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with w k when Wk+i constant or w k +i = constant — wi, which gives the second 
bound. □ 
Proof of the corollary First note that we can safely assume z\ > \i k . If the 
result holds for this case, then it is easy to see that having Z\ < (if- only brings 
the center of mass c closer to ji k . 



w k Pk[C]c k 



P[C] 



< (1 - r)c fe + tz 2 



(26) 



Now ( 22|23 I follow from Proposition 15 For (24) Assumption [4] assures that 
R > 0. As a consequence, this bound is convex in w k - If fc = 1 and c\ < /ii, 
or k = K and ck > (J>k then the second term in the sum ( |26[ ) pulls c\ in the 
direction of [i\ (respectively ck in the direction of [Ik) and we can get the 
tighter bounds ( p5| . □ 
In conclusion, we have shown now that if the unpruned center c "belongs" to 
Gaussian fc, then 



c e A k (w k ) = [fi k - R T (w k ), fi k + R+(w k )} 

whith R~(w k ) = (1 - r)R(w k ) + r(/i fc - fik-i), Rti^k) = (1 - r)R(w k ) + 
- Hk), R7( w i) = R( w i), and R+(w K ) = R(w K ). 



5.4 Step [i] of Pruned MinDiam. Selecting the centers by 
the MinDiam heuristic 

From Section |5.2| we know that w.h.p all centroids unpruned at this stage are 
(1 — t) pure. We want to ensure that after the selection in step [4] each Gaussian 
has at least one c^ 1> near its center. For this, it is sufficient that the regions 
Ak are disjoint, i.e 

(Hk+l ~ Mfc) - {Rt( W k) + Rr( w k+l)) > Rr( w k) + R$ {™k) 

(fik+i - V>k) - (Rt(wk) + R~(w k+ i)) > R~(w k+ i) + Rt(w k+ i) 

for all fc. Replacing R^(w k ) with their definitions and optimizing over all pos- 
sible wi-.K > Wmin and for all A/i < y, k +i — Hk < A max produces 

A k = \fi k ± (1 - T)R(w max ) ± TA max ] 

and 

Assumptions (1 - 3t)A - TA max > [3R(w max ) + R(w min )](l - r). 



6 Simulations 

In this section we test our conjecture in practice and run some simulations to 
emphasize the different theoretical results of the previous sections. We also 
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Figure 4: Simulation results. First row: data set "two dim four balanced clus- 
ters" . Second row: data set "two dim four imbalanced clusters" . Third row: 
data set "ten dim ten clusters" (see text for details) . 

investigate whether it is necessary to look at the stability of fc-means with re- 
spect to the random drawing of the data set. In the following when we refer to 
randomization we mean with respect to the initialization while the resampling 
corresponds to the random drawing of the data set. 

Setup of the experiments. As distributions we consider mixtures of Gaus- 
sians in one, two, and ten dimensions. Each mixture consists of several, reason- 
ably well separated clusters. We report the results on three such data sets: 

• "Two dim four balanced clusters" : Mixture of four Gaussians in R 2 with 
means (—3.3), (0,0), (3,3), (3,-3); the covariance matrix of all clusters is 
diagonal with entries 0.2 and 1 on the diagonal; the mixing coefficients are 
uniform, that is all clusters have the same weight. 

• "Two dim four imbalanced clusters" : As above, but with mixing coefficients 
0.1,0.5,0.3,0.1. 

• "Ten dim ten clusters" : Mixture often Gaussians in R 10 with means (i, 0, 0, ...) 
for i = 1, 10. All Gaussians are spherical with variance 0.05 and mixing 
coefficients are uniform. 

As clustering algorithm we use the standard fc-means algorithm with the follow- 
ing initializations: 

• Standard initialization: randomly pick K' data points. 

• MinDiam initialization, coincides with Step 5 in Fig. [2] 

• Pruned MlNDlAMinitialization, as analyzed in Section [5] (see Fig. [2] 

• Deterministic initialization: K' fixed points sampled from the distribution. 
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For a range of parameters K' € {2, 10} we compute the clustering stability 
by the following protocols: 

• Randomization, no resampling: We draw once a data set of n = 100 points 
from the distribution. Then we run the /c-means algorithm with different 
initializations. 

• Resampling, no randomization: We fix a set of deterministic starting points 
(by drawing them once from the underlying distribution) . Then we draw 100 
data sets of size n = 100 from the underlying distribution, and run fc-means 
with the deterministic starting points on these data sets. 

• Resampling and randomization: we combine the two previous approaches. 
Then we compute the stability with respect to the minimal matching distance 
between the clusters. Each experiment was repeated 100 times, we always re- 
port the mean values over those repetitions. 



Note that all experiments were also conducted with different data set sizes 
(n = 50, 100, 500), stability was computed with and without normalization (we 
used the normalization suggested in Lange et al. 2004), and the /c-means algo- 
rithm was used with and without restarts. All those variations did not signifi- 
cantly effect the outcome, hence we omit the plots. 



Results. First we evaluate the effect of the different initializations. To this 
end, we count how many initializations were "good initializations" in the sense 
that each true cluster contains at least one initial center. In all experiments we 
consistently observe that both the pruned and non-pruned min diameter heuris- 
tic already achieve many good runs if K' coincides with K or is only slightly 
larger than the true K (of course, good runs cannot occur for K 1 < K). The 
standard random initialization does not achieve the same performance. See Fig- 
ure [4] first column. 

Second, we record how often it was the case that initial cluster centers cross 
cluster borders. We can see in Figure [4] (second column) that this behavior 
is strongly correlated with the number of "good initializations" . Namely, for 
initialization methods which achieve a high number of good initializations the 
fraction of centers which cross cluster borders is very low. Moreover, one can 
see in the third column of Figure [4] that centers usually do not cross cluster 
borders if the initialization was a good one. This coincides with our theoretical 
results. 



Finally, we compare the different protocols for computing the stability: using 
randomization but no resampling, using resampling but no randomization, and 
using both randomization and resampling, cf. right most plots in Figure [4] In 
simple data sets, all three protocols have very similar performance, see for exam- 
ple the first row in Figure [4] That is, the stability values computed on the basis 
of resampling behave very similar to the ones computed on the basis of ran- 
domization, and all three methods clearly detect the correct number of clusters. 
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Combining randomization and resampling does not give any advantage. How- 
ever, on the more difficult data sets (the imbalanced one and the 10-dimcnsional 
one), we can see that resampling without randomization performs worse than 
the two protocols with randomization (second and third row of Figure [5]) . While 
the two protocols using randomization have a clear minimum around the correct 
number of clusters, stability based on resampling alone fails to achieve this. We 
never observed the opposite effect in any of our simulations (we ran many more 
experiments than reported in this paper). This shows, as we had hoped, that 
randomization plays an important role for clustering stability, and in certain 
settings can achieve better results than resampling alone. 

Finally, in the experiments above we ran the fc-means algorithm in two modes: 
with restarts, where the algorithm is started 50 times and only the best solution 
is kept; and without restarts. The results did not differ much (above we report 
the results without restarts). This means that in practice, for stability based 
parameter selection one can save computing time by simply running fc-means 
without restarting it many times (as is usually done in practice). From our 
theory we had even expected that running fc-means without restarts achieves 
better results than with restarts. We thought that many restarts diminish the 
effect of exploring local optima, and thus induce more stability than "is there" . 
But the experiments did not corroborate this intuition. 



7 Conclusions and outlook 

Previous theoretical work on model selection based on the stability of the fc- 
means algorithm has assumed an "ideal fc-means algorithm" which always ends 
in the global optimum of the objective function. The focus was to explain how 
the random drawing of sample points influences the positions of the final centers 
and thus the stability of the clustering. This analysis explicitly excluded the 
question when and how the fc-means algorithm ends in different local optima. 
In particular, this means that these results only have a limited relevance for the 
actual fc-means algorithm as used in practice. 

In this paper we study the actual fc-means algorithm. We have shown that the 
initialization strongly influences the fc-means clustering results. We also show 
that if one uses a "good" initialization scheme, then the fc-means algorithm is 
stable if it is initialized with the correct number of centers, and instable if it 
is initialized with too many centers. Even though we have only proved these 
results in a simple setting so far, we are convinced that the same mechanism 
also holds in a more general setting. 



These results are a first step towards explaining why the selection of the number 
of clusters based on clustering stability is so successful in practice Lange et al. 
(20041. From this practical point of view, our results suggest that introducing 
randomness by the initialization may be sufficient for an effective model selec- 



21 



tion algorithm. Another aspect highlighted by this work is that the situations 
K' < K and K' > K may represent two distinct regimes for clustering, that 
require separate concepts and methods to be analyzed. 

The main conceptual insight in the first part of the paper is the configura- 
tions idea described in the beginning. With this idea we indirectly characterize 
the "regions of attraction" of different local optima of the fc-means objective 
function. To our knowledge, this is the first such characterization in the vast 
literature of k- means. 



In the second part of the paper we study an initialization scheme for the fc-means 
algorithm. Our intention is not to come up with a new scheme, but to show 
that a scheme already in use is "good" in the sense that it tends to put initial 
centers in different clusters. It is important to realize that such a property does 
not hold for the widely used uniform random initialization. 
On the technical side, most of the proofs and proof ideas in this section are 



novel. In very broad terms, our analysis is reminiscent to that of Dasgupta and 



Schulman ( 2007 1 . One reason we needed new proof techniques lie partly in the 



fact that we analyze one-dimensional Gaussians, whose concentration properties 
differ qualitatively from those of high dimensional Gaussians. We loose some 
of the advantages high dimensionality confers. A second major difference is 
that fc-means behaves qualitatively differently from EM whenever more than 
one Gaussian is involved. While EM weights a point "belonging" to a cluster 
by its distance to the cluster center, to the effect that far away points have a 
vanishing influence on a center Cj , this is not true for fc-means. A far-away point 
can have a significative influence on the center of mass Cj , precisely because of 
the leverage given by the large distance. In this sense, fc-means is a more brittle 
algorithm than EM, is less predictible and harder to analyze. In order to deal 
with this problem we "eliminated" impure clusters in Section |5.2| Third, while 



Dasgupta and Schulman (20071 is concerned with finding the correct centers 



when K is known, our analysis carries over to the regime when K' is too large, 
which is qualitatively very different of the former. 

Of course many initialization schemes have been suggested and analyzed in the 



literature for fc-means (for examples see Ostrovsky et al. 2006 Arthur and Vas- 
silvitskii 20071. However, these papers analyze the clustering cost obtained 



with their initialization, not the positions of the initial centers. 
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